Skip to content

Conversation

@jjonescz
Copy link
Member

@jjonescz jjonescz commented Nov 28, 2024

@ghost ghost added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 28, 2024
@jjonescz jjonescz added Documentation Area-Compilers and removed Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 28, 2024
@jjonescz jjonescz requested review from a team, AlekseyTs and jaredpar November 28, 2024 12:07
@jjonescz jjonescz changed the title Add spec for the utf8-string-literal-encoding feature flag Add spec for the data section string literals feature flag Nov 28, 2024
@jcouv jcouv self-assigned this Dec 1, 2024
@jjonescz jjonescz requested a review from jcouv December 4, 2024 20:26
@jjonescz jjonescz requested a review from AlekseyTs December 13, 2024 12:35
@AlekseyTs
Copy link
Contributor

@jjonescz Consider updating PR's title. The "feature flag" is just part of the feature, it is not the feature.

@AlekseyTs
Copy link
Contributor

Done with review pass (commit 10)

@jjonescz jjonescz changed the title Add spec for the data section string literals feature flag Add spec for the data section string literals feature Dec 14, 2024
Copy link
Contributor

@AlekseyTs AlekseyTs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (commit 11)

Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks (iteration 11)

@teo-tsirpanis
Copy link
Contributor

Would be interesting to extend this to encoding constant strings being passed to Convert.FromBase64String, as raw binary data.

The protobuf compiler for example generates such strings. And BTW if I replace string.Concat with +s the compiler gets stuck.

@AlekseyTs
Copy link
Contributor

AlekseyTs commented Dec 26, 2024

@teo-tsirpanis

Would be interesting to extend this to encoding constant strings being passed to Convert.FromBase64String, as raw binary data.

Could you please elaborate what would be an advantage for doing this? Also, how does this align with the specific goal that we are trying to achieve here, which is not a goal of supporting different encodings for string literals?

@jjonescz jjonescz merged commit f617688 into dotnet:main Jan 15, 2025
5 checks passed
@jjonescz jjonescz deleted the DataSectionStringLiterals-spec branch January 15, 2025 16:59
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Jan 15, 2025
@dibarbet dibarbet modified the milestones: Next, 17.14 P1 Jan 28, 2025

For every unique string literal, a unique internal static class is generated which:
- has name composed of `<S>` followed by a hex-encoded XXH128 hash of the string
(collisions [should not happen][xxh128] with XXH128 and so they aren't currently detected or reported, the behavior in that case is undefined),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a security bug that the collision is undetected?

The collisions can happen with some very small but non-zero probability. If a program is unlucky to hit the collision, the string literals will be messed up and the behavior of the program will be impacted in interesting ways. This can be used by adversaries to contribute an innocently looking string to an open-source project where the string is carefully crafted to cause a collision with some other critical string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few places in the compiler where we use hashes and collisions could lead to program behavior changes. These are part of our security reviews in the past and this will be included in our next one. Thus far nothing has risen to the level of forcing extra checking. There is an increase in non-cryptographic hashing support though and I will make sure that is included in the next security review. @RikkiGibson

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, XXH128 is non-cryptographic hash, so one should assume that it is easy to find collisions. If we are going to keep the current implementation for this feature, I would be curious to know how we are justifying that it is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand the concern here. But you can create similar collisions today at much lower level places in the compiler by creating collisions with file contents. The use of non-crypto hashes was approved at that layer too (was in fact suggested by security team). The general reasoning is that once you can control code being submitted to the project you have lots of avenues for subverting expectations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general reasoning is that once you can control code being submitted to the project you have lots of avenues for subverting expectations.

In the open-source world, it is important that the project maintainers are able to review the changes with confidence.

For example, let's say that this feature gets enabled for some part of OpenTelemetry project and somebody submits a PR that includes a code like this:

void DiagnosticThatIsTurnedOffByDefault()
{
   // Random unique string to help us identify blah
   DoStuff("ruroiodffdshofps09rw403492049358302849052peldfjskdfshgfdshiofofjsekljhkdkjafhkds3");
}

How can the maintainer of the OpenTelemetry project tell that this change is actually planting a backdoor into an unrelated code due to a hash collision?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xxHash128 is also used to identify source documents in the IDE currently, as well as to identify interceptable calls in source code. We consider it collision-safe for these purposes. See also the Examples at the bottom of this doc: https://web.archive.org/web/20240223124645/https://fossies.org/linux/xxHash/tests/collisions/README.md

@CyrusNajmabadi in case you have further thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xxHash128 is also used to identify source documents in the IDE currently

I would not be worried about the IDE case from the security angle. If somebody crafts a file name with a hash collision and it makes the IDE to misbehave or crash, it is not a big deal. (You may be scratching your head for a while if you are tasked with investigating the crash report.)

Copy link
Member Author

@jjonescz jjonescz Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think it should be straightforward to add collision detection, it just didn't seem straightforward to test that, but I can look into that more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think hash collisions might not be a problem - #77061 (comment)


### Runtime support

We could emit the strings to data fields similarly but we would not synthesize the `string` fields and static constructors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this happen if you plan to advertise this feature for broader use, and not just as a workaround for the few folks who run into the limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to use the current state of implementation, compiler only trick, to get feedback on the scenarios. We did a number of benchmarks on the work and didn't see meaningful differences that justified preemptively doing runtime work. Decided that we would ship it out as experimental, get real world feedback from teams like Bing, and then let that drive what we wanted the final state of the feature to be before we remove experimental lable

@davidwrighton, @DamianEdwards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants